library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)
library(lme4)
library(rstatix)

options(scipen=999)
df = read.csv(here::here('avg_tensor_by_roi_wide.csv'),
              colClasses = c('subject' = 'factor',
                             'site' = 'factor',
                             'visit' = 'factor')) %>% 
  rename(scan = visit)
outcomes = df %>% 
  select(where(is.numeric)) %>% 
  colnames

Summary

The purpose of the DTIMSCAMRAS study is to test for site effects on DTI metrics of different brain regions in the context of a traveling subjects design with a harmonized imaging protocol. Participants (n=11) with stable Multiple Sclerosis (MS) were scanned in 4 different locations: National Institute of Health (NIH), Brigham and Women’s Hospital (BWH), Johns Hopkins University (Hopkins), and The University of Pennsylvania (Penn). Two scans were collected per site visit, and participants were re-positioned between scans. The modalities that were collected include T1s (MPRAGE), FLAIR, and Diffusion-weighted images (DWI).

Preprocessing

To preprocess imaging data, qsiprep was used for bias correction, skull-stripping and co-registration of the T1s and DWIs, and specialized denoising and artifact correction for the DWIs. Then, DTIFIT was used to compute the following scalar maps:

  • Fractional Anisotropy
  • Mean Diffusivity
  • Axial Diffusivity
  • Radial Diffusivity

Concurrently, several segmentation pipelines were used on the T1s (+ FLAIRs in the case of MIMOSA) with the purpose of defining particular regions of interest (ROIs):

  • White Matter and Gray Matter Segmentation methods:
    • ANTs ATROPOS
    • FSL FAST
    • Multi-Atlas Segmentation with Joint Label Fusion using the MUSE templates (JLF WM and JLF GM)
  • Thalamus Segmentation
    • FSL FIRST
    • Multi-Atlas Segmentation with Joint Label Fusion using the OASIS templates (JLF THALAMUS)
  • Lesion Segmentation methods:
    • MIMOSA

The segmentation results were used to average the scalar maps across all voxels belonging to each of the ROIs. The 36 resulting metrics make up the set of outcome variables which are the focus of the site effects test. Below, the outcome variables are listed.

data.frame(OUTCOME = outcomes) %>% 
  separate(OUTCOME, c("SEGMENTATION", "ROI", "SCALAR"), remove = FALSE)

Note the general format: for example, the ATROPOS_GM_AD variable denotes the average Axial Diffusivity (AD) across the Gray Matter (GM) regions defined by the Atropos Segmentation.

Testing site effects

Permutation-Based F-test

A permutation-based F-test was used to test for site effects. For each subject, the absolute difference between all possible pairs of intra-site and inter-site measures were computed. For instance, take the first row of the subject data.frame below, which has BWH as the “reference site”: the absolute differences between that row and the 6 non-BWH rows are computed for \(y\)——this process is repeated for all rows (and the results averaged) to arrive to the average inter-site difference for this subject; for the average intra-site differences the absolute difference of the 4 intrasite pairs are averaged.

df %>% 
  filter(subject == '01001') %>% 
  select(!where(is.numeric), ATROPOS_GM_AD) %>% 
  rename(y = ATROPOS_GM_AD)

These differences were averaged across all subjects resulting in the Mean Absolute Difference Between Sites (\(\overline{MAD_{B}}\)) and the Mean Absolute Difference Within Sites (\(\overline{MAD_{W}}\)). The test statistic is composed of their ratio:

\[ F = \frac{\overline{MAD_{B}}}{\overline{MAD_{W}}} \]

For each metric, the observed test statistic was compared to a distribution of test statistics derived from 10000 permutations (i.e. a null distribution). The proportion of null results higher than the observed test statistic served as a preliminary p-value \(p_0\). To prevent p-values of 0 (when the observed statistic was higher than all permuted results), the p-value was shifted using the following formula:

\[ p = \frac{10000*p_0 + 1}{10000 + 1}\]

Cross-design Mixed Linear Models

As an additional test of site effects, mixed models were performed corresponding to a crossed design using the lmer function of the lme4 package. For each outcome variable, two mixed-effects models were fitted: The formula for the base model includes a random intercept for site as well as a random intercept for subject. The formula for the extended model also includes these terms in addition to a fixed effect term for site.

Base Model:

y ~ (1|subject) + (1|site)

Extended model:

y ~ site + (1|subject) + (1|site)

For each outcome, these two models were compared through a deviance test using the anova function in R. P-values for these tests are reported as a test of site effects.

Repeated-Measures ANOVA

As yet another test of site effects, a repeated-measures ANOVA was performed for each metric. Here, only data from scan 1 was used. The models were run using the anova_test function from the rstatix package. In these models, site was specified as a within-subject factor.

Key Findings

  • Permutation tests revealed significant site effects in that all 36 outcomes before and after FDR adjustment.
  • For mixed models, most outcome variables (31 out of 36) show a significant test before and after FDR adjustment indicating the presence of site effects.
  • For repeated-measures ANOVA, 23 out of 36 metrics show significant effect of site on subjects measures indicating.
  • Permutation-based F-tests were more sensitive to site effects than other methods. Repeated-measure ANOVA was the least sensitive of the three methods.

Questions for follow up

  • The data included in this report was non-distortion corrected data for Penn, NIH and BWH; while Hopkins data was all distortion correct. Should the analysis be redone for non-Hopkins data? Note: DTI data was only collected without distortion correction [TODO: double check] in non-Hopkins data.
  • Manual Lesion segmentations are not included in this report. Should these be included in this assessment?

Data Inventory

Subjects 03001 and 03002 did not complete imaging at all 4 sites. Subject 03001 completed imaging at Penn and BWH only and 03002 completed imaging at BWH, Hopkins, and the NIH only.

Subject 02001 is missing DTI data from their NIH visit.

Scans at site per subject

t_df <- df %>% 
   count(subject, site, .drop = FALSE) %>% 
  tidyr::pivot_wider(names_from = site, values_from = n) %>% 
  mutate(`Row Totals` = BWH + Hopkins + NIH + Penn) 
t_df <- t_df %>% 
  summarize(across(where(is.integer), sum)) %>% 
  bind_rows(t_df, .) %>% 
  mutate(across(where(is.factor), as.character))
t_df[is.na(t_df)] <- "Column Totals"
t_df

For Atropos (substracting failed segmentations)

Atropos segmentation failed for 6 images: 04001NIH01, 01003NIH01, 03002NIH02, 04003NIH01, 04001BWH02, and 03001BWH01.

knitr::include_graphics(here::here(c('misc/atropos_success.png', 'misc/atropos_fail.png')))
Left: Example of successful Atropos segmentation; Right: Example of a failed Atropos segmentationLeft: Example of successful Atropos segmentation; Right: Example of a failed Atropos segmentation

Left: Example of successful Atropos segmentation; Right: Example of a failed Atropos segmentation

The following table excludes the failed atropos segmentations. Note subject 03001 only has 3 images after excluding failed atropos segmentations.

t_df <- df %>% 
  unite(sub, subject, site, scan, sep = '-') %>%
  filter(!sub %in% c('04001-NIH-01', '01003-NIH-01', '03002-NIH-02', '04003-NIH-01', '04001-BWH-02', '03001-BWH-01')) %>%
  separate(sub, c('subject', 'site', 'scan')) %>%
  mutate_if(is.character, as.factor) %>% 
  count(subject, site, .drop = FALSE) %>% 
  tidyr::pivot_wider(names_from = site, values_from = n) %>% 
  mutate(`Row Totals` = BWH + Hopkins + NIH + Penn)

t_df <- t_df %>% 
  summarize(across(where(is.integer), sum)) %>% 
  bind_rows(t_df, .) %>% 
  mutate(across(where(is.factor), as.character))
t_df[is.na(t_df)] <- "Column Totals"
t_df

The failed atropos segmentations are included in the following visualizations but were excluded for the site effects tests.

Visualizations

Fractional Anisotropy

Densities

Densities are colored by site, while the black line represents the overall density aggregated across sites. Distribution of the different sites tend to cluster together except for JLF Thalamus.

plot_densities <- function(var){
  df %>% 
    ggplot(aes_string(x=var, color='site')) + 
    geom_density() + 
    stat_density(aes_string(x = var), geom = "line", color = "black") +
    theme_bw() + 
    xlab(str_replace_all(var, "_", " "))
}

vars <-df %>% 
  select(ends_with('FA')) %>% 
  colnames()
dplots <- lapply(vars, plot_densities) %>% 
  setNames(vars)
for (name in names(dplots)){
  cat("#####",  str_replace_all(name, "_", " "), "\n")
  print(dplots[[name]])
  cat("\n\n")
}
ATROPOS GM FA

ATROPOS WM FA

FAST GM FA

FAST WM FA

FIRST THALAMUS FA

JLF GM FA

JLF WM FA

JLF THALAMUS FA

MIMOSA LESION FA

Box plots

Box plots show difference in medians across sites for most variables.
Additionally, note the severe outliers in the ATROPOS metrics which correspond to the failed segmentations. For FIRST THALAMUS, subject 01002 could be considered an outlier; their FA values are consistent across sites and scans, however, suggesting this variation is meaningful and that this subject should be retained.

plot_avg_tensor_values <- function(tensor){
  df %>% 
    select(!is.numeric, ends_with(tensor)) %>% 
    pivot_longer(ends_with(tensor), names_to = 'seg', values_to = 'average') %>% 
    separate(seg, into = c('segmentation', 'roi', 'tensor')) %>% 
    unite(segmentation, segmentation, roi, sep = " ") %>% 
    ggplot(aes(x=site, y=average)) + 
    geom_boxplot(outlier.shape = NA) + 
    geom_jitter(aes(color = subject, shape=scan), alpha=0.5) +
    #facet_grid(site~.) + 
    facet_grid(segmentation ~ .) +
    coord_flip() +
    ggtitle(sprintf("Mean %s in ROI by site", tensor)) +
    theme_bw() + 
    # theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust=1)) +
    scale_x_discrete(expand = c(.00000001, .00000001)) +
    xlab("") + 
    ylab("") + 
    theme(strip.text.y = element_text(angle = 0))
}

plot_avg_tensor_values('FA')

Mean Diffusivity

Densities

For mean diffusivity, distributions vary more widely across sites. In particular MIMOSA LESION MD values are quite different for Hopkins data.

vars <-df %>% 
  select(ends_with('MD')) %>% 
  colnames()
dplots <- lapply(vars, plot_densities) %>% 
  setNames(vars)
for (name in names(dplots)){
  cat("#####", str_replace_all(name, "_", " "), "\n")
  print(dplots[[name]])
  cat("\n\n")
}
ATROPOS GM MD

ATROPOS WM MD

FAST GM MD

FAST WM MD

FIRST THALAMUS MD

JLF GM MD

JLF WM MD

JLF THALAMUS MD

MIMOSA LESION MD

Boxplot

As before, medians appear to be different across sites.

plot_avg_tensor_values('MD')

Radial Diffusivity

Densities

Radial Diffusivity also shows some variations across sites (perhaps less than MD). JLF GM RD in NIH shows multiple peaks, and JLF WM RD shows skewed distributions for all sites, pointing to potential non-normality. As before the metric for MIMOSA shows a different distribution for Hopkins data.

vars <-df %>% 
  select(ends_with('RD')) %>% 
  colnames()
dplots <- lapply(vars, plot_densities) %>% 
  setNames(vars)
for (name in names(dplots)){
  cat("#####",  str_replace_all(name, "_", " "), "\n")
  print(dplots[[name]])
  cat("\n\n")
}
ATROPOS GM RD

ATROPOS WM RD

FAST GM RD

FAST WM RD

FIRST THALAMUS RD

JLF GM RD

JLF WM RD

JLF THALAMUS RD

MIMOSA LESION RD

Boxplot

As before, medians appear to be differet across sites.

plot_avg_tensor_values('RD')

Axial Diffusivity

Densities

For Axial Diffusivity, Hopkins distributions show marked differences across most ROIs compared to other sites. The difference is quite pronounced for MIMOSA.

vars <-df %>% 
  select(ends_with('AD')) %>% 
  colnames()
dplots <- lapply(vars, plot_densities) %>% 
  setNames(vars)
for (name in names(dplots)){
  cat("#####",  str_replace_all(name, "_", " "), "\n")
  print(dplots[[name]])
  cat("\n\n")
}
ATROPOS GM AD

ATROPOS WM AD

FAST GM AD

FAST WM AD

FIRST THALAMUS AD

JLF GM AD

JLF WM AD

JLF THALAMUS AD

MIMOSA LESION AD

Boxplot

As before, medians show differences, particularly for Hopkins.

plot_avg_tensor_values('AD')

Permutation Testing of Site Effects

# replace ATROPOS measures with NA for select images (segmentation failed)
atropos_cols <- df %>% 
  select(contains('ATROPOS')) %>% 
  colnames()
df[(df$subject == '04001' & df$site == 'NIH' & df$scan == '01') |
   (df$subject == '01003' & df$site == 'NIH' & df$scan == '01') |
   (df$subject == '03002' & df$site == 'NIH' & df$scan == '02') |
   (df$subject == '04003' & df$site == 'NIH' & df$scan == '01') |
   (df$subject == '04001' & df$site == 'BWH' & df$scan == '02') |
   (df$subject == '03001' & df$site == 'BWH' & df$scan == '01'), atropos_cols] <- NA

The permutation-based F-test was carried out as previously described. The F-statistic for each of the metrics is plotted below as a black dot, while the distribution of permuted test statistics is shown as a white violin plot. All metrics showed significant site effects.

ratio_stats = readRDS(here::here("results/avg_tensor_by_roi_wide_ratio_stat.rds"))
null_dist = readRDS(here::here("results/avg_tensor_by_roi_wide_null.rds"))
null_dist %>% 
  pivot_longer(everything()) %>% 
  ggplot(aes(x=name, y=value)) + 
  geom_violin(draw_quantiles = c(0.95)) + 
  geom_point(aes(x=name, y=value), 
             data = ratio_stats %>% 
               pivot_longer(everything())) + 
  coord_flip()

The observed values (“black dots”) are represented here in table form along with their unajusted and adjusted p-values:

P-value of pseudo-F stat:

test_ratio_stat = function(ratio.stats, null.dists){
  
  p.vals = c()
  for (col in colnames(ratio.stats)){
    ratio.stat = ratio.stats[, col]
    null.dist = null.dists[, col]
    
    percentile = ecdf(null.dist)
    p.vals = c(p.vals, percentile(ratio.stat))
  }
  names(p.vals) = colnames(ratio.stats)
  return(p.vals)
}

p = 1 - test_ratio_stat(ratio_stats, null_dist)
p_df <- data.frame(p_val = (10000*p + 1)/(1+10000)) 

perm_df <- cbind(
  ratio_stats %>% 
    pivot_longer(everything(), names_to = 'outcome', values_to = 'observed_statistic'),
  p_df %>% 
    mutate(p_val_fdr = p.adjust(p_val, method = 'fdr'))
) 

rownames(perm_df) <- NULL
perm_df

Mixed-Effect Models

Two mixed models (a base and an extended model) were performed as described above. The p-values for these tests are arranged in increasing order below.

models <- outcomes %>% 
  lapply(function(x) lmer(reformulate("site + (1|subject) + (1|site)", x), data=df)) %>% 
  setNames(outcomes)
models_0 <- outcomes %>% 
  lapply(function(x) lmer(reformulate("(1|subject) + (1|site)", x), data=df)) %>% 
  setNames(outcomes)
anovas <- purrr::map2(models, models_0, ~ anova(.x, .y)) # deviance test between the two models
# pull p.values from all tests
p_vals <- anovas %>% 
  purrr::map(~ broom::tidy(.x) %>%
               pull(p.value) %>% 
               na.omit()) %>% 
  purrr::reduce(c)

anova_test_df <- data.frame(outcome = names(anovas), p_val = p_vals) %>% 
  arrange(p_val) %>% 
  mutate(p_val_fdr = p.adjust(p_val, method = 'fdr'))
anova_test_df

Note that, prior to FDR-adjustment, most outcome variables show a significant test indicating the presence of site effects. However, after FDR-adjustment, the vast majority of tests become non-significant.

Repeated-Measures ANOVA

Repeated-Measures ANOVAs were performed as described above. Below the table shows the F-statistic, effect size (generalized eta squared) and p-values (adjusted and unadjusted).

options(scipen=999)
find_icc <- function(data){
  
  irr::icc(data,
           model = "twoway", 
           type = "agreement",
           unit = "single")
  
}

run_anova <- function (var){
  df %>% 
    filter(scan == '01') %>% 
    anova_test(dv = var, wid = subject, within = site) %>% 
    get_anova_table()
}

anovas <- outcomes %>% 
  lapply(run_anova) %>% 
  setNames(outcomes)


data.frame(measure = names(anovas), 
           F_stat = anovas %>% 
             purrr::map(~ .x$F) %>% 
             purrr::reduce(c),
           ges = anovas %>% 
             purrr::map(~ .x$ges) %>% 
             purrr::reduce(c),
           p = anovas %>% 
             purrr::map(~ .x$p) %>% 
             purrr::reduce(c)) %>% 
  arrange(p) %>% 
  mutate(p_val_fdr = p.adjust(p, method = "fdr"))

ICC between scan 1 and scan 2 was calculated for each site. Results are shown below. Most sites show high agreement between scans across metrics.

pair_df_list <- df %>% 
  pivot_longer(where(is.numeric)) %>% 
  split(list(.$site, .$name))

# compute names for each pair_df
df_names <- pair_df_list %>% 
  purrr::map(~ .x %>% 
               transmute(df_name = paste(site, name, sep="-")) %>% 
               pull() %>% 
               unique()
  ) %>% 
  purrr::reduce(c)
# finish transformation
pair_df_list <- pair_df_list %>% 
  purrr::map(~ .x %>% 
               pivot_wider(names_from = c('name', 'scan')) %>% 
               select(where(is.numeric))
  ) %>% 
  setNames(df_names)

icc_df <- pair_df_list %>% 
  purrr::map(find_icc) %>% 
  purrr::imap(~ data.frame(outcome = .y, 
                           icc = .x$value, 
                           lwr = .x$lbound,
                           upr = .x$ubound,
                           p_val = .x$p.value)) %>% 
  dplyr::bind_rows(.id = 'outcome') %>% 
  separate(outcome, into = c('site', 'outcome'), sep='-')
options(scipen=999)
icc_df